In this example I will build a classifier for churn prediction using a dataset from telecomm industry. You can find the data set in github in the following links.
https://github.com/abulbasar/data/tree/master/Churn%20prediction
There are two files
In [1]:
import xgboost as xgb
import pandas as pd
from sklearn import *
import matplotlib.pyplot as plt
%matplotlib inline
Load the training data
In [2]:
df_train = pd.read_csv("/data/churn-bigml-80.csv")
df_train.head()
Out[2]:
Let's check number of records, number of columns, types of columns and whether the data contains NULL values.
As we see it contains 2665 records, 20 columns, and no null values. There are three catagorical values.
In [3]:
df_train.info()
Let's check distribution of the output class. As it shows it contains 85% records are negative. It gives a sense of desired accuracy - which is closure to 90% or more.
In [4]:
df_train.Churn.value_counts()
Out[4]:
In [5]:
df_train.Churn.value_counts()/len(df_train)
Out[5]:
In [6]:
df_train.columns
Out[6]:
Loaded the test data and performed similar analysis as before.
In [7]:
df_test = pd.read_csv("/data/churn-bigml-20.csv")
df_test.info()
In [8]:
df_test.Churn.value_counts()/len(df_test)
Out[8]:
In [9]:
len(df_test)/len(df_train)
Out[9]:
Sort out of categorical and numeric columns so that it can be passed to pipeline for pre-proceessing steps. In the processing steps, we are doing the following
Althought the Area Code is numeric, here I am considering this as categorical since it is a qualitative variable in nature.
In [10]:
cat_columns = ['State', 'Area code', 'International plan', 'Voice mail plan']
num_columns = ['Account length', 'Number vmail messages', 'Total day minutes',
'Total day calls', 'Total day charge', 'Total eve minutes',
'Total eve calls', 'Total eve charge', 'Total night minutes',
'Total night calls', 'Total night charge', 'Total intl minutes',
'Total intl calls', 'Total intl charge', 'Customer service calls']
In [11]:
target = "Churn"
X_train = df_train.drop(columns=target)
y_train = df_train[target]
X_test = df_test.drop(columns=target)
y_test = df_test[target]
In [12]:
cat_pipe = pipeline.Pipeline([
('imputer', impute.SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', preprocessing.OneHotEncoder(handle_unknown='error', drop="first"))
])
num_pipe = pipeline.Pipeline([
('imputer', impute.SimpleImputer(strategy='median')),
('scaler', preprocessing.StandardScaler()),
])
preprocessing_pipe = compose.ColumnTransformer([
("cat", cat_pipe, cat_columns),
("num", num_pipe, num_columns)
])
X_train = preprocessing_pipe.fit_transform(X_train)
X_test = preprocessing_pipe.transform(X_test)
In [13]:
pd.DataFrame(X_train.toarray()).describe()
Out[13]:
Build a basic logistic regression model and decision tree models and check the accuracy. Basic logistic regression model gives accuracy of 85%.
In [14]:
est = linear_model.LogisticRegression(solver="liblinear")
est.fit(X_train, y_train)
y_test_pred = est.predict(X_test)
est.score(X_test, y_test)
Out[14]:
In [15]:
est = tree.DecisionTreeClassifier(max_depth=6)
est.fit(X_train, y_train)
y_test_pred = est.predict(X_test)
est.score(X_test, y_test)
Out[15]:
Print classification report. The report shows that precision and recall score quite poor. Accuracy is 85%. Confusion metrics shows a high number of false positive and false negatives.
In [16]:
print(metrics.classification_report(y_test, y_test_pred))
In [17]:
metrics.confusion_matrix(y_test, y_test_pred)
Out[17]:
Next, we build a similar model using XGBoost. Performance the model is slightly better than logistic regression model.
In [43]:
eval_sets = [
(X_train, y_train),
(X_test, y_test)
]
cls = xgb.XGBRFClassifier(silent=False,
scale_pos_weight=1,
learning_rate=0.1,
colsample_bytree = 0.99,
subsample = 0.8,
objective='binary:logistic',
n_estimators=100,
reg_alpha = 0.003,
max_depth=10,
gamma=10,
min_child_weight = 1
)
print(cls.fit(X_train
, y_train
, eval_set = eval_sets
, early_stopping_rounds = 10
, eval_metric = ["error", "logloss"]
, verbose = True
))
print("test accuracy: " , cls.score(X_test, y_test))
In [ ]:
In [19]:
cls.evals_result()
Out[19]:
In [20]:
y_test_pred = cls.predict(X_test)
In [21]:
metrics.confusion_matrix(y_test, y_test_pred)
Out[21]:
In [22]:
y_test_prob = cls.predict_proba(X_test)[:, 1]
y_test_prob
Out[22]:
In [23]:
auc = metrics.roc_auc_score(y_test, y_test_prob)
auc
Out[23]:
In [24]:
ftr, tpr, thresholds = metrics.roc_curve(y_test, y_test_prob)
In [25]:
plt.rcParams['figure.figsize'] = 8,8
plt.plot(ftr, tpr)
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title("ROC, auc: " + str(auc))
Out[25]:
Cross validate the model
In [26]:
params = { 'objective': "binary:logistic"
, 'colsample_bytree': 0.9
, 'learning_rate': 0.01
, 'max_depth': 10
, 'alpha': 0.5
, 'min_child_weight': 1
, 'subsample': 1
, 'eval_metric': "auc"
, 'n_estimators': 300
, 'verbose': True
}
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
cv_results = xgb.cv(dtrain=data_dmatrix
, params=params
, nfold=5
, maximize = "auc"
, num_boost_round=100
, early_stopping_rounds=10
, metrics=["logloss", "error", "auc"]
, as_pandas=True
, seed=123
, verbose_eval=True
)
cv_results
Out[26]:
In [27]:
cv_results[["train-error-mean"]].plot()
Out[27]:
Install graphviz to display the decision graph
$ conda install graphviz python-graphviz
In [28]:
plt.rcParams['figure.figsize'] = 50,50
xgb.plot_tree(cls, num_trees=0, rankdir='LR')
Out[28]:
These plots provide insight into how the model arrived at its final decisions and what splits it made to arrive at those decisions.
Note that if the above plot throws the 'graphviz' error on your system, consider installing the graphviz package via pip install graphviz on cmd. You may also need to run sudo apt-get install graphviz on cmd.
In [29]:
plt.rcParams['figure.figsize'] =15, 15
xgb.plot_importance(cls, )
Out[29]:
In [30]:
cls.feature_importances_
Out[30]:
In [31]:
one_hot_encoder = preprocessing_pipe.transformers_[0][1].steps[1][1]
one_hot_encoder
Out[31]:
In [32]:
one_hot_encoder.get_feature_names()
Out[32]:
In [33]:
preprocessing_pipe.transformers_[0][1]
Out[33]:
In [34]:
parameters = {
'max_depth': range (2, 10, 1),
'n_estimators': range(60, 220, 40),
'learning_rate': [0.1, 0.01, 0.05]
}
cls = xgb.XGBRFClassifier(silent=False,
scale_pos_weight=1,
learning_rate=0.01,
colsample_bytree = 0.99,
subsample = 0.8,
objective='binary:logistic',
n_estimators=100,
reg_alpha = 0.003,
max_depth=10,
gamma=10,
min_child_weight = 1
)
grid_search = model_selection.GridSearchCV(
estimator=cls,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 12,
cv = 10,
verbose=True,
return_train_score=True
)
grid_search.fit(X_train, y_train)
Out[34]:
In [35]:
grid_search.best_estimator_
Out[35]:
In [36]:
grid_search.best_params_
Out[36]:
In [37]:
grid_search.best_score_
Out[37]:
In [38]:
pd.DataFrame(grid_search.cv_results_)
Out[38]:
In [39]:
folds = 5
param_comb = 5
cls = xgb.XGBRFClassifier(silent=False,
scale_pos_weight=1,
learning_rate=0.01,
colsample_bytree = 0.99,
subsample = 0.8,
objective='binary:logistic',
n_estimators=100,
reg_alpha = 0.003,
max_depth=10,
gamma=10,
min_child_weight = 1
)
skf = model_selection.StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)
random_search = model_selection.RandomizedSearchCV(cls,
param_distributions=parameters,
n_iter=param_comb,
scoring='accuracy',
n_jobs=12,
cv=skf.split(X_train,y_train),
verbose=3,
random_state=1001 )
random_search.fit(X_train, y_train)
Out[39]:
In [40]:
random_search.best_score_, random_search.best_params_
Out[40]:
In [41]:
pd.DataFrame(random_search.cv_results_)
Out[41]:
In [ ]: